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By Daniel a M. Witten 
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In recent years, advances in high throughput sequencing teclmol- 
ogy have led to a need for specialized methods for the analysis of 
digital gene expression data. While gene expression data measured 
on a microarray take on continuous values and can be modeled using 
the normal distribution, RNA sequencing data involve nonnegative 
counts and are more appropriately modeled using a discrete count 
distribution, such as the Poisson or the negative binomial. Conse- 
quently, analytic tools that assume a Gaussian distribution (such as 
classification methods based on linear discriminant analysis and clus- 
tering methods that use Euclidean distance) may not perform as well 
for sequencing data as methods that are based upon a more appro- 
priate distribution. Here, we propose new approaches for performing 
classification and clustering of observations on the basis of sequencing 
data. Using a Poisson log linear model, we develop an analog of di- 
agonal linear discriminant analysis that is appropriate for sequencing 
data. We also propose an approach for clustering sequencing data us- 
ing a new dissimilarity measure that is based upon the Poisson model. 
We demonstrate the performances of these approaches in a simulation 
study, on three publicly available RNA sequencing data sets, and on 
a publicly available chromatin immunoprecipitation sequencing data 
set. 

1. Introduction. 

1.1. An overview of RNA sequencing data. Since the late 1990s, a vast 
literature has been devoted to quantifying the extent to which different 
tissue types, biological conditions, and disease states are characterized by 
particular patterns of gene expression, or mRNA levels [examples include 
DeRisi, Iyer and Brown (1997), Spellman et al. (1998), Brown and Botstein 
(1999), Ramaswamy et al. (2001), Nielsen et al. (2002), Monti et al. (2005)]. 
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During most of that time, the microarray has been the method of choice for 
quantifying gene expression. Though the microarray has led to an improved 
understanding of many cellular processes and disease states, the technology 
suffers from two fundamental limitations: 

(1) Cross- hybridization can occur, whereby cDNA hybridizes to a probe 
for which it is not perfectly matched. This can lead to high levels of back- 
ground noise. 

(2) Only transcripts for which a probe is present on the array can be 
measured. Therefore, it is not possible to discover novel mRNAs in a typical 
microarray experiment. 

In recent years, high throughput or second generation RNA sequencing 
has emerged as a powerful alternative to the microarray for measuring gene 
expression [see, e.g., Mortazavi et al. (2008), Nagalakshmi et al. (2008), Wil- 
helm and Landry (2009), Wang, Gerstein and Snyder (2009), Pepke, Wold 
and Mortazavi (2009)]. This technology allows for the parallel sequencing of 
a large number of mRNA transcripts. Briefly, RNA sequencing proceeds as 
follows [Mortazavi et al. (2008), Morozova, Hirst and Marra (2009), Wang, 
Gerstein and Snyder (2009), Auer and Doerge (2010), Oshlack, Robinson 
and Young (2010)]: 

(1) RNA is isolated and fragmented to an average length of 200 nu- 
cleotides. 

(2) The RNA fragments are converted into cDNA. 

(3) The cDNA is sequenced. 

This process results in millions of short reads, between 25 and 300 base- 
pairs in length, usually taken from one end of the cDNA fragments (though 
some technologies result in "paired-end" reads) . The reads are typically then 
mapped to the genome or transcriptome if a suitable reference genome or 
transcriptome is available; if not, then de novo assembly may be required 
[Oshlack, Robinson and Young (2010)]. The mapped reads can then be 
pooled into regions of interest. For instance, reads may be pooled by gene 
or by exon, in which case the data consist of nonnegative counts indicating 
the number of reads observed for each gene or each exon. In this paper we 
will assume that mapping and pooling of the raw reads has already been 
performed. We will consider RNA data sets that take the form of n x p ma- 
trices, where n indicates the number of samples for which sequencing was 
performed, and p indicates the number of regions of interest (referred to as 
"features"). The element of the data matrix indicates the number of 
reads from the ith sample that mapped to the jth region of interest. Sequenc- 
ing data are generally very high dimensional, in the sense that the number of 
features p is much larger than the number of observations n. Specifically, p 
is usually on the order of tens of thousands, if not much larger. 
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RNA sequencing has some major advantages over the microarray. RNA 
sequencing data should in theory be much less noisy than microarray data, 
since the technology does not suffer from cross-hybridization. Moreover, 
novel transcripts and coding regions can be discovered using RNA sequenc- 
ing, since unlike studies performed using microarrays, sequencing experi- 
ments do not require pre-specification of the transcripts of interest. For 
these reasons, it seems certain that RNA sequencing is on track to replace 
the microarray as the technology of choice for the characterization of gene 
expression. 

1.2. Statistical models for RNA sequencing data. Two aspects of se- 
quencing data are especially worth noting, as they result in unique statistical 
challenges. (1) Due to artifacts of the sequencing experiment, different sam- 
ples can have vastly different total numbers of sequence reads. This issue is 
generally addressed by normalizing the samples in some way, for instance, 
by the total number of reads observed for each sample [Mortazavi et al. 
(2008)] or a more robust alternative [BuUard et al. (2010), Robinson and 
Oshlack (2010), Anders and Huber (2010)]. Simply dividing the counts for 
a given sample by a normalization constant may not be desirable, since the 
magnitude of the counts may contain information about the variability in 
the data. (2) Since a sequencing data set consists of the number of reads 
mapping to a particular region of interest in a particular sample, the data 
are integer-valued and nonnegative. This is in contrast to microarray data, 
which is measured on a continuous scale and can reasonably be modeled 
using a Gaussian distribution. 

Let X denote a. n x p matrix of sequencing data, with n observations 
(e.g., tissue samples) and p features (regions of interest; e.g., genes or ex- 
ons). Xij is the count for feature j in observation i. For instance, if feature j 
is a gene, then Xij is the total number of reads mapping to gene j in obser- 
vation i. A number of authors have considered a Poisson log linear model 
for sequencing data, 

(1) Xij ~ Poisson(A'ij), Nij = SiQj 

[among others, Marioni et al. (2008), Bullard et al. (2010), Witten et al. 
(2010), Li et al. (2011)]. To avoid identifiability issues, one can require 
^"^^ Sj = 1. This model allows for variability in both the total number of 
reads per sample (via the Si term) and in the total number of reads per 
region of interest (via the gj term). Since biological replicates seem to be 
overdispersed relative to the Poisson model, some authors have proposed an 
extension to (1) involving the use of a negative binomial model, a natural 
alternative to the Poisson model that allows for the variance to exceed the 
mean [among others, Robinson, McCarthy and Smyth (2010), Anders and 
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Huber (2010)]. Specifically, one could extend (1) to obtain 

(2) Xij ~ NB {Nij Nij = SiQj , 

where NB indicates the negative binomial distribution and > is the 
dispersion parameter for feature j. Throughout this paper, the negative 
binomial distribution will be parametrized such that (2) implies that obser- 
vation Xij has mean Nij and variance Nij+Nl-(j)j. When (j)j = 0, (2) reduces 
to (1). 

RNA sequencing experiments are often designed such that the n obser- 
vations are drawn from K different biological conditions, or classes. To ac- 
commodate this setting, a number of authors have extended (1) and (2) as 
follows: 

(3) Xij\yi = kr^Foisson{Nijdkj), Nij = Sigj, 

(4) Xij\yi = kr^ ]SSB{Nijdkj,(l)j), Nij = SiQj, 

where yi indicates the class of the zth observation, yi G {l,...,ii'}. Here 
the dij, . . . ,dKj terms allow the jth feature to be differentially expressed 
between classes. [However, as written in (3) and (4), the precise roles of 
dij,. . . , dxj are difficult to interpret because the model is overparametrized. 
We will address this point in Section 2.1.] The models (3) and (4) have 
been used to identify features that are differentially expressed between con- 
ditions [Marioni et al. (2008), Bullard et al. (2010), Witten et al. (2010), 
Li et al. (2011), Robinson, McCarthy and Smyth (2010), Anders and Huber 
(2010)]. 

Though the question of how best to identify differentially expressed fea- 
tures has now been extensively studied, it is just one of many possible sci- 
entific questions that may arise from sequencing data. This paper addresses 
the following two problems: 

(1) If each sample is associated with a class label, then one might wish to 
build a classifier in order to predict the class label of a future observation. 

(2) If the samples are unlabeled, one might wish to cluster the samples 
in order to identify subgroups among them. 

Most of the methods in the statistical literature for classification and cluster- 
ing implicitly assume a normal distribution for the data. In this paper, due 
to the nature of sequencing data, the Poisson log linear models (1) and (3) 
will be used to accomplish these two tasks. The importance of the model 
used can be seen on a simple toy example. We generated two-dimensional 
Poisson distributed random variables with two different means, each repre- 
senting a different class. Each dimension was generated independently. The 
Bayes-optimal decision boundaries obtained assuming normality and assum- 
ing a Poisson distribution are shown in Figure 1. 
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Fig. 1. Two sets of two-dimensional independent random variables were generated. The 
first set of random variables was generated according to the Poisson(lO) distribution in 
each dimension, and the second set was generated according to the Poisson(28) distribu- 
tion in each dimension. In each figure, the two sets of random variables are shown as 
black and grey circles, after jittering. The grid in the background of each plot indicates 
the Bayes-optimal decision boundary, assuming a normal distribution (left) or a Poisson 
distribution (right) with the correct mean and variance. 

1.3. Notation and organization. The following additional notation will 
be used in this paper. Let Xj = [Xn ■ ■ ■ Xip)^ denote row i of X, correspond- 
ing to the feature measurements for observation i. Also, X.j = Y17=i-^ij^ 
Xi. = X^j=i-'^ij, X.. = "^ijXij. Moreover, in the classification setting where 
each observation belongs to one of K classes, we let Cfc C {1, . . . , n} contain 
the indices of the observations in class k — that is, yi = /c if and only if i G C^. 
Furthermore, Xc^j = Y^iac^ ^ij- 

The rest of this paper is organized as follows. In Section 2 the model (3) is 
presented in greater detail, along with methods for fitting it that are based 
upon recent proposals in the literature. This model is used as the basis for 
the classifier proposed in Section 3 and as the basis for the clustering method 
proposed in Section 4. In Section 5 the performances of the classification and 
clustering proposals are evaluated in a simulation study. The proposals are 
applied to four sequencing data sets in Section 6. Section 7 contains the 
Discussion. 

2. A Poisson log linear model for multiple-class sequencing data. 

2.1. The model. In this paper sequencing data are modeled using a Pois- 
son log linear model. However, the proposals in this paper could be extended 
to the negative binomial model using techniques developed in Robinson, Mc- 
Carthy and Smyth (2010) and Anders and Huber (2010). 

The model (1) captures the fact that sequencing data are characterized 
by high levels of variation in both the number of counts per sample (sj) and 
the number of counts per feature (gj), and (3) additionally allows for the 
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level of expression of a given feature to depend upon the condition under 
which it is observed {dij, . . . , dKj)- Throughout this paper, we assume that 
the Xjj's are independent of each other for all i = 1, . . . ,n, j = 1, . . . ,p. 

We first consider the problem of fitting the model (1). The maximum like- 
lihood estimate (MLE) for Nij is Nij = [Agresti (2002)]. Combining 
this with the identifiability constraint that Y^l^i Si = l yields the estimates 
Si = Xi./X.. and gj = X.j. We can interpret Sj as an estimate of the size 
factor for sample i, reflecting the fact that different samples may have been 
sequenced to different depths. A number of authors have used this size fac- 
tor estimate [Marioni et al. (2008), Mortazavi et al. (2008)]. Recently, it 
has been pointed out that Xi./X.. is not a very good estimate for Sj since 
changes in a few high-count features can have a great effect on the value 
of Xi., skewing any resulting analyses [Bullard et al. (2010), Robinson and 
Oshlack (2010), Anders and Ruber (2010), Li et al. (2011)]. For this reason, 
several more robust estimates for the size factor Sj have been proposed. In 
what follows we will consider three estimates for Sj: 

(1) Total count. We simply use Si = Xi./X.., the total count for the ith 
observation, which is based upon the MLE for Nij under the model (1). 

(2) Median ratio. Anders and Ruber (2010) propose the use of Si = mi/ 
I]r=i "^i' where 



That is, the size factor for the ith sample is obtained by computing the 
median, over all p features, of the ith sample count for that feature divided 
by the geometric mean of all sample counts for that feature. 

(3) Quantile. Bullard et al. (2010) propose taking Si = qi/Ylll=i Qi^ where qi 
is the 75th percentile of the counts for each sample. 

Thus, throughout this paper, we will estimate Nij in (1) according to Nij = 
SiQj, where Sj is given by one of the methods described above, and gj = X.j. 

We now consider the problem of fitting the model (3). Since we would like 
to attribute as much as possible of the observed variability in the counts for 
each feature to sample and feature effects (sj and gj) rather than to class 
differences (d^j), we estimate Nij under the model (1) without making use 
of the class labels. We then estimate d^j by treating Nij as an offset in the 
model (3). That is, we fit the model 

(6) Xij\yi = A; ~ Poisson(iVjj(ifcj). 

Maximum likelihood provides a natural way to estimate dkj in (6), yielding 
'^kj = v^~~^v~- Now dkj has a simple interpretation: if d^j > 1, then the 
jth feature is over-expressed relative to the baseline in the kth class, and if 
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dkj < 1, then the jth feature is under-expressed relative to the basehne in 
the kth class. 

However, if Xci^j = (an event that is not unlikely if the true mean for 
feature j is small), then the maximum likelihood estimate for d^j equals 
zero. This can pose a problem for downstream analyses, since this esti- 
mate completely precludes the possibility of a nonzero count for feature j 
arising from an observation in class k. We can remedy this problem by 
putting a Gamma(/3,/3) prior on di^j in the model (6). Here, the shape and 
rate parameters both equal (3. Then, the posterior distribution for d^j is 
Gamma(Xc^,j + /?, J2i£Ck ~'~ posterior mean is 

f7) d = ^'^"^ ^ ^ 

Equation (7) is a smoothed estimate of d^j that behaves well even if Xc,^j = 
for some class k. We took /? = 1 in all of the examples shown in this paper. 

2.2. A transformation for overdispersed data. A number of authors have 
observed that, in practice, biological replicates of sequencing data tend to 
be overdispersed relative to the Poisson model, in the sense that the vari- 
ance is larger than the mean. This problem could be addressed by using 
a different model for the data, such as a negative binomial model [Robinson, 
McCarthy and Smyth (2010), Anders and Huber (2010)]. Instead, we apply 
a power transformation to the data [Witten et al. (2010), Li et al. (2011)]. 
The transformation X'^j ^ Xf- is used, where a G (0, 1] is chosen so that 

1=1 j=l J 

This is simply a test of the goodness of fit of the model (1) to the data 
[Agresti (2002)], using the total count size factor estimate Si = Xl/X'.. 

Though the resulting transformed data are not integer- valued, we nonethe- 
less model them using the Poisson distribution. This simple transformation 
allows us to use a Poisson model even in the case of overdispersed data, in 
order to avoid having to fit a necessarily more complicated negative binomial 
model (a task made especially complicated in the typical setting for sequenc- 
ing data where the number of samples is small). In Section 5 we show that 
even when data are generated according to a negative binomial model with 
moderate overdispersion, the classification and clustering proposals based 
on the Poisson model perform well on the transformed data. 

3. A proposal for classifying sequencing data. 

3.1. The Poisson linear discriminant analysis classifier. Suppose that 
we wish to classify a test observation x* = (X^ ■ ■ ■ X*)"^ on the basis of 
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training data {(xj, yi)}"^]^. Let y* denote the unknown class label. By Bayes' 
rule, 



where fk is the density of an observation in class k and vrfc is the prior 
probability that an observation belongs to class k. If fk is a normal den- 
sity with a class-specific mean and common covariance, then assigning an 
observation to the class for which (9) is largest results in standard LDA 
[for a reference, see Hastie, Tibshirani and Friedman (2009)]. If we instead 
assume that the observations are normally distributed with a class-specific 
mean and a common diagonal covariance matrix, then diagonal LDA results 
[Dudoit, Fridlyand and Speed (2001)]. The assumption of normality is not 
appropriate for sequencing data, and neither is the assumption of a common 
covariance matrix for the K classes. We instead assume that the data arise 
from the model (3), and we also assume that the features are independent. 
The assumption of independence is often made for high-dimensional contin- 
uous data [e.g., see Dudoit, Fridlyand and Speed (2001), Tibshirani et al. 
(2002, 2003), Bickel and Levina (2004), Witten and Tibshirani (2011)] since 
when p> n, there are too few observations available to be able to effectively 
estimate the dependence structure among the features. 

Evaluating (9) requires estimation of /fc(x*) and vTfc. The model (3) states 
that X*\y* = /c ~ Poisson{s* gjdkj) ■ We first estimate si, . . . ,Sn, the size fac- 
tors for the training data, using the total count, quantile, or median ratio 
approaches (Section 2.1). We then estimate gj and dkj by evaluating gj = X.j 
and (7) on the training data. Finally, we estimate s* as follows: 

• If the total count estimate for the size factors was used, then s* = X]j=i / 
X.., where X.. is the total number of counts on the training data. 

• If the median ratio estimate for the size factors was used, then s* = 

tn* / '^"'.mi, where m* = medianj{-r™ — ^ \i/n } — ^ote that the denom- 
inator is the geometric mean for the jth feature among the training ob- 
servations. Here mj is given by (5). 

• If the quantile estimate for the size factors was used, then s* = q* / X^^Li 1i- 
Here, q* is the 75th percentile of counts for the test observation, and qi is 
the 75th percentile of counts for the ith training observation. 

Note that these estimates of s* are the direct extensions of the size factor 
estimates presented in Section 2.1, applied to the test observation x*. 

We now consider the problem of estimating vr^ . We could let vri = • • • = 
-kk = ^/K-, corresponding to the prior that all classes are equally likely. 
Alternatively, we could let vr^ = |Cfc|/n, if we believe that the proportion 
of observations in each class seen in the training set is representative of the 
proportion in the population. In the examples presented in Sections 5 and 6, 
we take tti = • • • = ttk = 1/A'. 



(9) 
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Plugging these estimates into (3) and recalling our assumption of inde- 
pendent features, (9) yields 



(10) 



log P{y* = k\x*)= logfkix*) + logTTfc + c 



^ X* log dkj - s* ^ 9jdkj + log TTfc + c', 
i=i i=i 



where c and c' are constants that do not depend on the class label. Only 
the first term in (10) involves the individual feature measurements for the 
test observation x*. Therefore, the classification rule that assigns the test 
observation to the class for which (10) is largest is linear in x*. For this 
reason, we call this classifier Poisson linear discriminant analysis (PLDA). 
This name refiects the linearity of the classifier, as well as the fact that it 
differs from standard LDA only in its use of a Poisson model for the data. 



3.2. The sparse PLDA classifier. PLDA's classification rule (10) is quite 
simple, in that it is linear in the components of x*. But when the estimate (7) 
is used for d^j, then d^j 7^ 1 in general and so the classification rule (10) in- 
volves all p features. For sequencing data, p may be quite large, and a classi- 
fier that involves only a subset of the features is desirable in order to achieve 
increased interpretability and reduced variance. By inspection, the classifi- 



cation rule (10) will not involve the data for feature j if d 



: dxj 



1. 



We obtain a classification rule that is sparse in the features by using the 
following estimate for d^j in (10), which shrinks the estimate (7) toward 1: 



(11) 



dkj 



a 
b 



_P_ 

Vb' 
a p 

b^Vl' 



1, 



if Vb 
iiVb 
ilVh 



1 >/0, 



a 
a 



>P, 



<P, 



where a = Xc^j + (3 and b = XlieCfe ~^ f^- Here, p is a nonnegative tuning 
parameter that is generally chosen by cross-validation. When p = 0, (11) is 
simply the estimate (7). As p increases, so does the number of estimates (11) 
that are exactly equal to 1. Using the estimate (11) in the PLDA classi- 
fier (10) yields sparse PLDA (sPLDA). The operation (11) can be written 
more concisely as dkj = 1 + S{a/b — l, p/ Vb) , where S is the soft-thresholding 
operator, given by S{x,a) = sign(a;)(|x| — a)+. Note that the form of (11) 
combined with the definition of b implies that if class k contains few obser- 
vations, or if the mean for class k in feature j is small, then the estimate 
for dkj will undergo greater shrinkage. 
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Sparse PLDA is closely related to the nearest shrunken centroids (NSC) 
classifier [Tibshirani et al. (2002, 2003)], which is a variant of diagonal LDA 
that arises from shrinking the class-specific mean vectors toward a common 
mean using the soft-thresholding operator. In fact, sPLDA arises from re- 
placing the normal model that leads to NSC with the Poisson model (3). 
For this reason, NSC is a natural method against which to compare sPLDA. 

4. A proposal for clustering sequencing data. 

4.1. Poisson dissimilarity. We now consider the problem of comput- 
ing a n X n dissimilarity matrix for n observations for which sequencing 
measurements are available. For microarray data, squared Euclidean dis- 
tance is a common choice of dissimilarity measure. Another popular choice, 
correlation-based distance, is equivalent to squared Euclidean distance up 
to a scaling of the observations [Hastie, Tibshirani and Friedman (2009)]. 
Squared Euclidean distance can be derived as the consequence of perform- 
ing hypothesis testing on a simple Gaussian model for the data. That is, 
consider the model 

(12) Xij ~ N{^iij,a^), Xi>j ~ N{fii,j,a'^), 

where we assume that the features and observations are independent. Con- 
sider testing the null hypothesis Hq : fiij = fj.i'j,j = 1, . . . ,p, against Ha, which 
states that /ijj and fii'j are unrestricted. The resulting log likelihood ratio 
statistic is proportional to 

j=l V / j=i V / j=i 

^''^ -11 - 11^ 

Therefore, squared Euclidean distance is equivalent to a log likelihood ra- 
tio statistic for each pair of observations, under a Gaussian model for the 
data. 

Now, as discussed earlier, the model (12) does not seem appropriate for 
sequencing data. Instead, consider the model 

Xij ~ Poisson{Nijdij), Xi/j ~ Foissoiii{Niijdi'j), 

(14) 

Nij = Sigj, Ni'j = Si'gj, 

where we assume that the features are independent. This is simply the 
model (3), restricted to Xj and Xj/. We first estimate Nij,Niij under the 
simpler model 

Xij ~ Poisson(A^jj), X^'j ~ Poisson(A^j/j), 

(15) 

Nij = Sigj, Ni>j = Si'gj 
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as described in Section 2.1 — using total count, quantile, or median ratio size 
factor estimates — but restricted to Xj and Xj/ . We then test the nuh hypoth- 
esis Hq : dij = diij = 1, J = 1, . . . ,p, against the alternative Ha, which states 
that dij and di/j are nonnegative. The resulting log likelihood ratio statistic 
can be used as a measure of dissimilarity between Xj and Xj/ . A standard log 
likelihood ratio statistic would involve computing the maximum likelihood 
estimates for dij and diij under Ha- However, to avoid the estimate dij =0 
if Xij = 0, we instead compute a modified log likelihood ratio statistic: we 
evaluate the log likelihood under Ha using the estimates 

(16) dii = ^-^ , diJi = ^r^ , 

^ ' ' Nij+^' iV,,,+/3' 

which are the posterior means for dij and diij under Gamma(/3,/3) priors. 
The resulting modified log likelihood ratio statistic is 

V 

(17) ^(A^ij + A^i'j - ^ijdij - Ni/jdi'j + Xij log dij + Xi>j log di'j). 
i=i 

Then (17) can be thought of as the dissimilarity between Xj and Xj/ under 
the model (3). The dissimilarity between two identical observations is 0, and 
all dissimilarities are nonnegative (see the Appendix). We will refer to the 
n X n dissimilarity matrix with element given by (17) as the Poisson 
dissimilarity matrix. 

Hierarchical clustering is a very popular approach for clustering in ge- 
nomics since it leads to a visual representation of the data and does not 
require prespecifying the number of clusters. Hierarchical clustering oper- 
ates on a n X n dissimilarity matrix, and can be performed using the Pois- 
son dissimilarity matrix. We will refer to the clustering obtained using this 
dissimilarity matrix as Poisson clustering. 

To obtain a px p Poisson dissimilarity matrix of the features rather than 
a n X n dissimilarity matrix of the observations, one could use a Poisson 
model and repeat the arguments in this section, reversing the roles of ob- 
servations and features in each of the relevant equations. One could then 
use this dissimilarity matrix in order to perform Poisson clustering of the 
features. 



4.2. Alternative approaches for clustering count data. We briefly review 
some approaches from the literature for computing dissimilarity matrices or 
performing clustering using count data. 

Serial analysis of gene expression (SAGE) is a sequencing-based method 
for gene expression profiling that predates RNA sequencing [Wang (2007)] . 
In SAGE a prespecified region of the RNA transcript is sequenced, and 
so the ability to detect previously unknown RNA transcripts is somewhat 
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limited. Cai et al. (2004) propose a procedure for performing K-means clus- 
tering of SAGE data using a Poisson model. Though their approach focuses 
on clustering features (known as tags in the context of SAGE data) rather 
than observations, their approach is fundamentally very similar to the one 
proposed here in that a Poisson model is used and deviations from the Pois- 
son model (measured using a chi-squared statistic or the log likelihood) are 
taken as an indication that two tags are different from each other and hence 
belong in different clusters. They propose to fit the Poisson model using the 
maximum likelihood parameter estimates, which is analogous to using total 
count size factor estimates in our model. They fit the Poisson model using 
all n observations at once rather than separately for each pair of observa- 
tions as in (15). In our experience, fitting the model separately for each pair 
of observations leads to better results. Their approach yields a prespecified 
number of clusters K, whereas ours yields a dissimilarity matrix that can be 
hierarchically clustered to obtain any number of clusters, or used for other 
purposes. 

Berninger et al. (2008) propose a method for computing a dissimilarity 
matrix using sequencing data that is also very closely related to ours. They 
assume that each observation is drawn from a multinomial distribution, and 
they test whether or not the multinomial parameters for each pair of obser- 
vations are equal. This is almost identical to our Poisson model and associ- 
ated hypothesis testing framework, since if the observations are distributed 
according to (14), then their distribution conditional on Xj., Xj/. is multino- 
mial. In fact, the log likelihood ratio statistics under our model and theirs 
are identical for certain very natural estimates of Nij, Ni/j, dij, and di'j 
in (17) (see the Appendix). However, there are some important differences 
between the two proposals. Berninger et al. (2008) place a Dirichlet prior 
on the parameters for the multinomial distribution, and then use a Bayes 
factor as a measure of the dissimilarity between two observations. Conse- 
quently, two identical observations can have nonzero dissimilarity according 
to Berninger et al. (2008), and two different observations can have smaller 
dissimilarity than two identical observations. This leads to problems in the 
interpretation of their dissimilarity measure as well as in the performance 
of any clustering approach that is based upon it. Finally, their approach 
can suffer from numerical issues where the computed dissimilarity between 
a pair of observations rounds to zero. 

The edgeR software package, available from Bioconductor [Robinson, 
McCarthy and Smyth (2010)], provides a tool for measuring the dissimilar- 
ity between a pair of observations based upon a negative binomial model. 
The q features with highest feature- wise dispersion across all n samples are 
selected, and the common dispersion of those q features for each pair of 
observations is used as a measure of pairwise dissimilarity. 
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Method 



Table 1 

Summary of approaches for computing dissimilarity measures 

Description 



EdgeR 



Berninger 



VST 



Sq. Euclidean total count 
Sq. Euclidean quantile 

Sq. Euclidean median ratio 

Poisson total count 

Poisson quantile 

Poisson median ratio 



A proposal in the edgeR software package, based on a neg- 
ative binomial model [Robinson, McCarthy and Smyth 
(2010)]. The dissimilarity between a pair of observations is 
computed as the common dispersion of the 500 features with 
the highest feature-wise dispersion across all n samples. 

An approach for computing dissimilarities between pairs of 
observations, using a multinomial model with a Dirichlet 
prior [Berninger et al. (2008)]. 

A variance stabilizing transformation (VST) based on a neg- 
ative binomial model [Anders and Huber (2010)] is applied 
to the data, and then squared Euclidean distances between 
pairs of observations are computed. 

Squared Euclidean distances are computed after scaling each 
sample by the total count size factor estimate (Section 2.1). 

Squared Euclidean distances are computed after scaling each 
sample by the quantile size factor estimate [Section 2.1; 
BuUard et al. (2010)]. 

Squared Euclidean distances are computed after scaling each 
sample by the median ratio size factor estimate [Section 2.1; 
Anders and Huber (2010)]. 

The data are transformed as in Section 2.2. Then Poisson 
dissimilarity is computed according to (17) using the total 
count size factor estimate (Section 2.1). 

The data are transformed as in Section 2.2. Then Poisson 
dissimilarity is computed according to (17) using the quan- 
tile size factor estimate [Section 2.1; BuUard et al. (2010)]. 

The data are transformed as in Section 2.2. Then Poisson 
dissimilarity is computed according to (17) using the median 
ratio size factor estimate [Section 2.1; Anders and Huber 
(2010)]. 



Finally, Anders and Huber (2010) propose a variance-stabilizing trans- 
formation based on the negative binomial model, and suggest performing 
standard clustering procedures on the transformed data — for instance, one 
could perform hierarchical clustering after computing the squared Euclidean 
distances between the transformed observations. 

In Sections 5 and 6 we compare our clustering proposal to these competing 
approaches. Details of the approaches used in the comparisons are given in 
Table 1. 
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5. A simulation study. 

5.1. Simulation setup. We generated data under the model 
(18) Xij\yi = k r^NB{sigjdkj,(j)), 

where (p is the dispersion parameter for the negative binomial distribution. 
Therefore, given that yi = k, Xij has mean SiQjdkj and variance SiQjdkj + 
{sigjdkj)'^(p- We tried three values of : </> = 0.01 (very slight overdispersion) , 
<j) = 0.1 (substantial overdispersion), and </) = 1 (very high overdispersion). 
There were K = 3 classes. The size factors are independent and identically 
distributed, Si ~ Unif (0.2, 2.2). The gj^s are independent and identically dis- 
tributed, gj ~ Exp(l/25). Each of the p = 10,000 features had a 30% chance 
of being differentially expressed between classes. If a feature was not differ- 
entially expressed, then dij = d2j = d^j = 1. If a feature was differentially 
expressed, then log dkj = Zkj where the Zkj's are independent and identically 
distributed, z^j ~ A^(0,c7^). The value of a depended on the simulation and 
is specified in Tables 2 and 3 below. 

5.2. Evaluation of sparse PLDA. We considered three classifiers: 

(1) NSC [Tibshirani et al. (2002, 2003)] after dividing each observation 
by a size factor estimate. 

(2) NSC after dividing each observation by a size factor estimate, and 
then transforming as follows: X'-- ^ sj X^ + 3/8. This transformation should 
make Poisson random variables have approximately constant variance [Ans- 
combe (1948)]. 

(3) sPLDA after performing the power transformation described in Sec- 
tion 2.2. 

For each classifier, three size factor estimates were used: total count, quan- 
tile, and median ratio. These are described in Section 2.1. Therefore, a total 
of nine classification methods were considered. Results are shown in Table 2. 
sPLDA performs quite well when there is little overdispersion relative to the 
Poisson model — that is, when the dispersion parameter, in the model (18) 
is small. The performance of sPLDA relative to NSC deteriorates when the 
data are very overdispersed relative to the Poisson model. Moreover, the 
square root transformation on the whole seemed to lead to substantially 
worse results for the NSC classifier. 

Interestingly, the choice of size factor estimate (total count, quantile, or 
median ratio) seems to have little effect on the classifiers' performances, 
despite the fact that the choice of estimate has been found to play a crit- 
ical role in the detection of differentially expressed features [Bullard et al. 
(2010), Robinson and Oshlack (2010), Anders and Huber (2010)]. This is 
likely because the size factor estimates yield very different results primarily 
in the setting where a subset of the features containing a large proportion 



Table 2 

Simulation results: nine classification methods. NSC, NSC on \/Xij + 3/8, and sPLDA were performed, using three different size factor 
estimates: total count (TC), quantile (Q), and median ratio (MR). Cross-validation was performed on a training set of n observations, 
and error rates were computed on n test observations. We report the mean numbers of test errors and nonzero features over 50 

simulated data sets. Standard errors are m parentheses 



n 


</) 


a 


Method 


NSC 


err. 


NSC sqrt err. 


sPLDA err. 


NSC nonzero 


NSC sqrt 


nonzero 


sPLDA 


nonzero 


12 


0.01 


0.05 


TC 


4.18 


(0.34) 


5.74 


(0.28) 


2.24 


(0.26) 


1947.6 


(441.3) 


2217.9 


(509.0) 


791.4 


(111.7) 








Q 


4.38 


(0.34) 


5.82 


(0.26) 


2.26 


(0.25) 


1670.6 


(394.5) 


2010.1 


(478.2) 


782.3 


(110.0) 








MR 


4.28 


(0.34) 


5.78 


(0.27) 


2.20 


(0.24) 


1731.8 


(402.6) 


2327.8 


(517.9) 


795.4 


(110.8) 


50 


0.01 


0.025 


TC 


19.14 


(0.67) 


24.06 


(0.70) 


16.84 


(0.55) 


2316.6 


(398.8) 


3122.5 


(516.6) 


1830.7 


(217.9) 








Q 


20.32 


(0.71) 


24.82 


(0.70) 


17.14 


(0.56) 


1870.7 


(335.2) 


3380.9 


(519.4) 


1860.2 


(229.6) 








MR 


19.66 


(0.69) 


24.48 


(0.69) 


16.88 


(0.60) 


2488.7 


(437.8) 


2698.7 


(513.4) 


1934.9 


(224.5) 


12 


0.1 


0.1 


TC 


2.52 


(0.31) 


2.66 


(0.26) 


1.58 


(0.25) 


5143.2 


(527.2) 


2738.5 


(461.2) 


3878.2 


(369.4) 








Q 


2.12 


(0.27) 


2.68 


(0.26) 


1.62 


(0.26) 


5207.0 


(536.8) 


2879.8 


(456.7) 


3927.2 


(371.7) 








MR 


2.28 


(0.29) 


2.88 


(0.28) 


1.60 


(0.26) 


4849.4 


(531.2) 


2932.0 


(477.3) 


3889.2 


(368.6) 


50 


0.1 


0.05 


TC 


16.80 


(0.54) 


17.76 


(0.61) 


17.94 


(0.70) 


3802.5 


(408.1) 


3785.5 


(418.1) 


3308.5 


(355.1) 








Q 


17.08 


(0.64) 


17.16 


(0.60) 


17.88 


(0.65) 


4293.2 


(479.1) 


3921.5 


(371.8) 


3284.0 


(352.8) 








MR 


16.78 


(0.59) 


17.34 


(0.66) 


17.96 


(0.71) 


3475.3 


(392.4) 


4398.3 


(457.0) 


3489.3 


(371.9) 


12 


1 


0.2 


TC 


3.24 


(0.25) 


4.28 


(0.35) 


4.26 


(0.32) 


8846.2 


(380.7) 


6127.8 


(524.6) 


4502.0 


(509.9) 








Q 


3.20 


(0.23) 


4.04 


(0.33) 


4.08 


(0.29) 


8991.8 


(318.8) 


6342.1 


(557.6) 


4551.7 


(512.1) 








MR 


3.22 


(0.26) 


3.60 


(0.30) 


4.00 


(0.31) 


8389.0 


(396.4) 


7082.7 


(515.5) 


4518.9 


(514.6) 


50 


1 


0.1 


TC 


25.56 


(0.61) 


25.80 


(0.55) 


25.66 


(0.50) 


4237.8 


(503.5) 


4293.5 


(495.9) 


3150.5 


(433.0) 








Q 


25.82 


(0.61) 


25.90 


(0.64) 


26.02 


(0.55) 


4629.1 


(516.2) 


4170.5 


(491.7) 


3131.2 


(406.6) 








MR 


25.92 


(0.68) 


25.86 


(0.59) 


25.52 


(0.51) 


4427.5 


(524.0) 


4362.6 


(498.0) 


3156.8 


(410.4) 
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Table 3 

Simulation results: ten clustering methods. Complete-linkage hierarchical clustering was 
applied to the nine dissimilarity measures described in Table 1, and each dendrogram was 

cut in order to obtain three clusters. The proposal of Cai et al. (2004) <^lso included; 
this required specifying K = 3. Mean CERs over 50 simulated data sets are reported, with 
standard errors in parentheses. The Poisson-based measures perform quite well when 
overdispersion is low, but tend to be outperformed by EdgeR in the presence of 
substantial overdispersion 



(p CT 


iVltJlIlOtl 


Clustering error rate 


0.01 0.15 


Cai 


U.ooyz 


lC\ AAVI \ 




Berninger 


U.0/U4 


/A Al '70\ 

(U.Oi 16) 




EdgeR 


U.UUUU 


/A AAAA\ 
(U.OOOOj 




VST 


U.DZUi 


I r\ AAOA\ 

(U.UUza ) 




Squared Euclidean total count 


A K^J^K 

U.oo(o 


1 C\ A 1 Al \ 

(u.uiyi ) 




Squared Euclidean quantile 


0.5662 


(0.0215) 




Squared Euclidean median ratio 


A K'TC; cc 

U.0(00 


/A Al '7Q^ 

(U.Ui lo] 




Poisson total count 


0.0045 


(0.0045) 




Poisson quantile 


0.0057 


(0.0047) 




Poisson median ratio 


0.0045 


(0.0045) 


0.1 0.2 


Cai 


0.3803 


(0.0058) 




Berninger 


0.1905 


(0.0258) 




EdgeR 


0.0000 


(0.0000) 




VST 


0.6204 


(0.0029) 




Squared Euclidean total count 


0.3051 


(0.0327) 




Squared Euclidean quantile 


0.2875 


(0.0325) 




Squared Euclidean median ratio 


0.3297 


(0.0350) 




Poisson total count 


0.2053 


(0.0225) 




Poisson quantile 


0.2067 


(0.0228) 




Poisson median ratio 


0.2006 


(0.0219) 


1 0.5 


Cai 


0.3797 


(0.0063) 




Berninger 


0.5309 


(0.0143) 




EdgeR 


0.0098 


(0.0054) 




VST 


0.6058 


(0.0089) 




Squared Euclidean total count 


0.1630 


(0.0242) 




Squared Euclidean quantile 


0.2190 


(0.0235) 




Squared Euclidean median ratio 


0.1998 


(0.0305) 




Poisson total count 


0.2699 


(0.0255) 




Poisson quantile 


0.2699 


(0.0255) 




Poisson median ratio 


0.2749 


(0.0254) 



of the counts are highly differentially expressed. However, in such a setting, 
classification tends to be quite easy. In more challenging classification set- 
tings in which differentially expressed features have fewer counts and display 
smaller differences between classes, such as in Table 2, the effect of the size 
factor estimate appears to play a less important role. 
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5.3. Evaluation of Poisson clustering. We compare the performances of 
ten clustering proposals: the iC-means clustering proposal of Cai et al. (2004) 
which assumes a Poisson model, as well as complete-linkage hierarchical 
clustering applied to the nine dissimilarity measures described in Section 4.2. 
Cai et al.'s (2004) proposal was performed with K = 3 (the true number of 
clusters) , and the hierarchical clustering dendrograms for the other methods 
were cut at a height that resulted in three clusters. 

To evaluate the performances of these clustering methods, we use the 
clustering error rate (CER), which measures the extent to which two parti- 
tions P and Q of a set of n observations disagree. Let Ip^i^i') be an indicator 
for whether observations i and i' are in the same group in partition P, and 
define lQ(i,i') analogously. Then CER is defined as 



This is also one minus the Rand Index [Rand (1971)]. We took P to be 
the true class labels and Q to be the class labels estimated via clustering; 
a small value indicates an accurate clustering. 

Simulation study results with n = 25 observations are shown in Table 3. 
The Poisson clustering proposed in this paper performs well for the full 
range of overdispersion parameters considered. This is in part because the 
transformation described in Section 2.2 makes the data approximately Pois- 
son even when the overdispersion parameter <j) is large. Even though the 
method of Berninger et al. (2008) is based on a model that is very similar 
to ours, it exhibits worse performance. This is likely due to numerical issues 
with their proposal whereby two different observations can have zero dis- 
similarity and two identical observations can have nonzero dissimilarity. It 
is difficult to compare the Poisson-based proposal of Cai et al. (2004) directly 
to the other nine since it uses a K-means approach, where the number of 
clusters must be specified in advance. Moreover, the proposals of Berninger 
et al. (2008) and Cai et al. (2004) do not entail first performing a power 
transformation on the data. 

In Table 3 the EdgeR dissimilarity measure exhibited essentially the same 
performance as our Poisson clustering measure when (p = 0.01, and better 
performance in the presence of moderate or severe overdispersion. However, 
it is quite computationally intensive. The example shown in Table 3 con- 
tains only 25 observations because running EdgeR using the Bioconductor 
package provided by the authors [Robinson, McCarthy and Smyth (2010)] 
is too slow for larger values of re. For instance, on a simulated example with 
re = 50 and p = 10,000, it took 6 minutes to compute the dissimilarity ma- 
trix on a AMD Opteron 848 2.20 GHz processor. In contrast, computing the 
Poisson dissimilarity matrix on the same example took 14 seconds. 



(19) 
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In summary, the Poisson dissimilarity measure outperforms all of the 
methods besides EdgeR. EdgeR is the overall winner, but is much more 
computationally demanding. 

6. Application to sequencing data sets. 

6.1. Data sets. We present results based on four data sets. The first three 
are RNA sequencing data sets, and the fourth is a chromatin immunopre- 
cipitation (ChIP) sequencing data set intended as a preliminary assessment 
of the extent to which the methods proposed here can be applied to other 
types of sequencing data. 

Liver and kidney. An RNA sequencing data set quantifying the expres- 
sion of 22,925 genes [Marioni et al. (2008)]. There are seven technical repli- 
cates from a liver sample and seven technical replicates from a kidney sam- 
ple, each from a single human male. The liver and kidney samples are treated 
as two separate classes. The data are available as a Supplementary File as- 
sociated with Marioni et al. (2008). 

Yeast. An RNA sequencing data set consisting of replicates of Saccha- 
romyces cerevisiae (yeast) cultures [Nagalakshmi et al. (2008)]. Three repli- 
cates were obtained for each of two library preparation protocols, "random 
hexamer" (RH) and "oligo(dT)" (dT). For each library preparation proto- 
col, there is an "original" replicate, a "technical" replicate of that original 
replicate, and also a "biological" replicate. The number of reads mapping 
to each of 6,874 genes is available as a Supplementary File associated with 
Anders and Huber (2010). In the analysis that follows, we treat the RH and 
dT library preparations as two distinct classes. 

Cervical cancer. An RNA sequencing data set quantifying the expression 
of microRNAs in tumor and nontumor human cervical tissue samples [Wit- 
ten et al. (2010)]. MicroRNAs are smah RNAs, 18-30 nucleotides in length, 
that have been shown to play an important regulatory role in a number 
of biological processes. The data take the form of 29 tumor and 29 non- 
tumor cervical tissue samples with measurements on 714 microRNAs. Of 
the tumor samples, 6 are adenocarcinomas (ADC), 21 are squamous cell 
carcinomas (SCC), and 2 are unclassified. The two unclassified samples are 
excluded from the analysis. Normal, ADC, and SCC are treated as three sep- 
arate classes. The data are available from Gene Expression Omnibus [Barrett 
et al. (2005)] under accession number GSE20592. Since this is a small RNA 
data set, the experimental protocol differs slightly from the description in 
Section 1.1: small RNAs were isolated before being converted to cDNA, 
which was then amplified and sequenced. As pointed out by a reviewer, Lin- 
sen et al. (2009) found that small RNA digital gene expression profiling is 
biased toward certain small RNAs, and so small RNA sequencing data sets 
cannot be used to accurately determine absolute numbers of small RNAs. 
However, this bias is systematic and reproducible, and so small RNA se- 
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quencing data sets can be used to determine relative expression differences 
between samples. The classification and clustering proposals in this paper 
rely on relative rather than absolute expression differences in the sense that 
accurate classification or clustering can be performed even if certain small 
RNAs contain a disproportionately large number of counts relative to the 
true abundance in the original sample. 

Transcription factor binding. ChIP sequencing is a new approach for map- 
ping protein-DNA interactions at a genome-wide level that relies upon re- 
cently developed techniques for high throughput DNA sequencing [Johnson 
et al. (2007)] . Like RNA sequencing, the results of a ChIP sequencing exper- 
iment can be arranged as a n x p matrix with n observations and p features. 
The features represent the DNA binding regions for a protein of interest, 
and the (i, j) element of the data matrix indicates the number of times that 
the protein was observed to bind to the jth binding region in the ith. sam- 
ple. In Kasowski et al. (2010), the binding sites of RNA polymerase II were 
mapped in each of ten individuals. 19,061 binding regions were identified, 
each of which was treated as a distinct feature. At least three replicates were 
available for each individual, and there were 39 observations in total. This 
data are available as a Supplementary File associated with Anders and Hu- 
ber (2010). In what follows, we treat each of the ten individuals as a distinct 
class. 

6.2. Evaluation of sparse PLDA. A total of nine classification methods 
were compared: NSC, NSC on y^Xjj + 3/8, and sPLDA, each with three dif- 
ferent size factor estimates. Details are given in Section 5.2. These methods 
were applied to the four data sets described in Section 6.1. Results on the 
cervical cancer and transcription factor binding data sets are shown in Fig- 
ure 2. Results for the liver and kidney data and the yeast data are not shown 
since on those two data sets, all methods gave cross-validation errors for 
all of the tuning parameter values considered. 

6.3. Evaluation of Poisson clustering. We clustered the observations in 
each of the four data sets described in Section 6.1. Eight dissimilarity mea- 
sures were used to perform complete-linkage hierarchical clustering (Ta- 
ble 1). The liver and kidney data resulted in a perfect clustering by all 
methods of comparison (results not shown). The cervical cancer, yeast, and 
transcription factor binding results are shown in Figures 3, 4 and 5. The cer- 
vical cancer data are challenging: the Poisson dissimilarity measures are best 
able to distinguish between tumor and nontumor samples, but no method 
is able to convincingly distinguish between ADC and SCC (Figure 3). For 
the yeast data, all methods but EdgeR and VST yield essentially the same 
dendrogram — one RH sample appears to be distinct from all the other sam- 
ples, but the remaining RH and dT samples are quite distinct. For that 
data, EdgeR and VST yield different (and presumably worse) clusterings. 
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Fig. 2. A comparison of classification methods on the cervical cancer data of Witten et al. 
(2010) and the transcription factor binding data of Kasowski et al. (2010). NSC, NSC on 
the square root transformed data, and sPLDA were performed each with three distinct size 
factor estimates — total count, quantile [Bullard et al. (2010)], and median ratio [Anders 
and Ruber (2010)], each of which is described m Section 2.1. Five-fold cross-validation 
was performed. The resulting cross-validation error curves are shown as a function of the 
number of features included in the classifier. Results for the yeast data of Nagalakshmi 
et al. (2008) and the liver and kidney data of Marioni et al. (2008) are not shown because 
all methods gave cross-validation errors for all of the tuning parameter values considered. 



EdgeR Sq. Euclidean Total Count Sq. Euclidean Quantile 5q. Euclidean iUledian Ratio 




VST Poisson Total Count Polsson Quantile Poisson Median Ratio 




Fig. 3. Complete-linkage hierarchical clustering dendrograms obtained by computing 
eight dissimilarity measures on the cervical cancer data of Witten et al. (2010). The dis- 
similarity measures are described in Table 1. Normal samples are shown as red solid lines, 
ADC as blue dotted lines, and SCC as green dashed lines. The Poisson-based approaches 
seem to do the best job of separating the normal samples from the tumor samples. 



The transcription factor binding data are the most complex since there are 
ten groups (one per individual). There is no clear winner, but EdgeR seems 
to perform quite well (Figure 5). The CERs for each dendrogram can be 
found in Figure 6. 
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EdgeR Sq. Euclidean Total Count 5q. Euclidean Quantile Sq. Euclidean l\/ledian Ratio 




VST Poisson Total Count Polsson Quantile Polsson Median Ratio 




Fig. 4. Complete-linkage hierarchical clustering dendrograms obtained by computing 
eight dissimilarity measures on the yeast data of Nagalakshmi et al. (2008). The dis- 
similarity measures are described in Table 1. The dT samples are shown m red and RH 
samples are in green. The three methods based on Euclidean distance and the three methods 
based on Poisson dissimilarity give the most accurate dendrograms on this data, since they 
successfully group the dT samples together. 



EdgeR Sq. Euclidean Total Count Sq. Euclidean Quantile Sq. Euclidean IVIedlan Ratio 




VST Polsson Total Count Poisson Quantile Polsson Median Ratio 




Fig. 5. Complete-linkage hierarchical clustering dendrograms obtained by computing 
eight dissimilarity measures on the transcription factor binding data of Kasowski et al. 
(2010). The dissimilarity measures are described in Table 1. Replicates from each of the 
ten individuals are shown in a different color. 



7. Discussion. In this paper we have proposed approaches for the classifi- 
cation and clustering of sequencing data. As sequencing technologies become 
increasingly widespread, the importance of statistical methods that are well 
suited to this type of data will increase. The approaches proposed in this 
paper were designed for RNA sequencing data, but could likely be extended 
to other sequencing technologies such as DNA and ChIP sequencing. In fact, 
an application to ChIP sequencing data was presented in Section 6. 
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Poisson Median Ratio 
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Number of Clusters 



Number of Clusters 



Fig. 6. For each of the eight dissimilarity measures described in Table 1, the figure 
displays the CERs that result from cutting the complete linkage hierarchical clustering 
dendrogram at various cutpomts. 



The methods proposed in this paper follow naturally from a simple Pois- 
son log linear model (3) for sequencing data. Similar approaches could be 
taken using an alternative model, for instance, one based on the negative 
binomial distribution. The methods proposed seem to work very well if the 
true model for the data is Poisson or if there is mild overdispersion relative to 
the Poisson model. Performance degrades in the presence of severe overdis- 
persion. Most sequencing data seem to be somewhat overdispersed relative 
to the Poisson model. It may be that extending the approaches proposed 
here to the negative binomial model could result in improved performance 
in the presence of overdispersion. 

A number of authors have proposed detecting differentially expressed fea- 
tures in sequencing data by making inference on a Poisson log linear model 
[see, e.g., Marioni et al. (2008), Bullard et al. (2010), Li et al. (2011)]. In 
this paper we have used such a model to develop proposals for classification 
and clustering. However, many other types of inference based on sequencing 
data are likely to be of interest in the future. For instance, in a recent pa- 
per, Lee, Huang and Hu (2010) proposed a method for performing principal 
components analysis (PCA) for high-dimensional binary data. In a similar 
vein, one could develop an approach for PCA for sequencing data using the 
Poisson log linear model (1). We leave this as a topic for further research. 

It has been shown that the manner in which samples are normalized is of 
great importance in identifying differentially expressed features on the ba- 
sis of sequencing data [Bullard et al. (2010), Robinson and Oshlack (2010), 
Anders and Huber (2010)]. However, in Sections 5 and 6, the normaliza- 
tion approach appeared to have little effect on the results obtained. This 
seems to be due to the fact that the choice of normalization approach is 
most important when a few features with very high counts are differen- 
tially expressed between classes. In that case, identification of differentially 
expressed features can be challenging, but classification and clustering are 
quite straightforward. 
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It is known that RNA-sequencing data suffer from transcript length bias — 
that is, longer transcripts tend to result in a greater number of reads, result- 
ing in an increased tendency to call such transcripts differentially expressed 
[Oshlack and Wakefield (2009)]. In a similar manner, the classification and 
clustering proposals made in this paper are affected by the total number of 
counts per feature; this can be seen by inspection of (7), (11) and (17). It 
seems clear that bias due to the total number of counts per feature is unde- 
sirable for the task of identifying differentially expressed transcripts, since 
it makes it difficult to detect differential expression for low-frequency tran- 
scripts. However, it is not clear that such a bias is undesirable in the case of 
classification or clustering, since we would like features about which we have 
the most information — namely, the features with the highest total counts — 
to have the greatest effect on the classifiers and dissimilarity measures that 
we use. More investigation into this matter is left as a topic for future work. 

Our proposal for clustering sequencing data is based on the development 
of a dissimilarity measure that is potentially more appropriate for count 
data than standard Euclidean distance. The resulting dissimilarity matrix 
can then be input to a standard clustering algorithm, such as hierarchical 
clustering. Other statistical techniques that rely upon a dissimilarity matrix, 
such as multidimensional scaling, could also be performed using the Poisson 
dissimilarity measure developed here. 

An R language package implementing the methods proposed in this paper 
will be made available. 

APPENDIX A: PROPERTIES OF THE POISSON 
DISSIMILARITY MEASURE 

We wish to prove that the dissimilarity between Xj and Xj/ specified in (17) 
is nonnegative, and that it equals zero when Xj = Xj/ . 

To prove nonnegativity, first notice that g{dij) = —Nijdij + Xijlogdij is 

a concave function of djj, and is maximized when dij = Therefore, 

diicT') ^ 5(1)- And since -^p — - is between 1 and concavity of g en- 

sures that g{-^ — ■£) > 5(1); that is, Nij — Nijdij + Xij log dij > 0. It follows 

directly that (17) is nonnegative. 

Now, if Xj = Xj', then Nij = Ni/j = Xij = Xi'j and so by (16), dij = di/j = 1. 
By inspection, (17) equals zero. 

APPENDIX B: EQUIVALENCE OF LOG LIKELIHOOD RATIO 
STATISTICS UNDER POISSON MODEL AND 
MULTINOMIAL MODEL 

Here, we show that the log likelihood ratio statistic (17) under the Poisson 
model is identical to the log likelihood ratio statistic under the model of 
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Berninger et al. (2008), if appropriate estimates of Nij, Ni'j, dij, and diij 
are used in (17). 

In Section 4.1 we assumed that the ith and i'th observations take the 
form Xij ~ Poisson{Ni j di j), Xiij ~ Poisson{Niijdi/j), Nij = Sigj,Niij = Si'gj. 
Under the nuh hypothesis, dij = di/j = 1. Under the alternative, dij and di/j 
are unconstrained. Suppose we estimate Nij and Ni'j under the null using 
the MLEs, or, equivalently, using the total count size factor estimates given 
in Section 2.1: then Nij = Xi.{Xij + Xi,j)/{Xi. + Xj/.), Ni'j = Xi,.{Xij + 
Xi'j)/{Xi. + Xif.). Treating these estimates as offsets under the alternative, 
the MLE for dij is Xij /Nij, and the MLE for di'j is Xij/Niij. Plugging these 
estimates into (17) yields 
p 

"^^{^i-i^ij + ^i'j)/ {^i- + ^i') + Xii.{Xij + Xiij)/ {Xi. + Xii.) 

i=i 

- Xij - Xi,j + Xij log{X,j /Nij) + Xi,j \og{Xi,j/Ni,j)) 

p 

(20) = ^{X,j logiXij/N,j) + Xi,j log{X,, j/Ni,j)) 
i=i 

p 

= '^{X,j logXij + Xi'j log Xi,j - {Xij + Xi'j) log{Xij + Xi,j)) 
i=i 

+ {Xi. + Xi,.)log{X,. + Xi,.) - Xi. logXi. - X,,. logXi,.. 

Now, Berninger et al. (2008) instead assume a multinomial model for the 
data: Xii,...,Xip ~ Multinomial(Xj., gi, . . . , grp) and Xi'i, . . . , Xiip ~ 
Multinomial(Xj'., ri, . . . , r^). Under the null, qj = Vj Under the alterna- 
tive, Qj and rj are unconstrained. This results in the likelihood ratio statistic 

^ (X,,/X,.)^-(X,,,/X,,^^'' 



^^^^ n {{x,j + xvj)/{x,. + x,,.))^-+^-'. ■ 

Taking the logarithm of (21) yields (20). 

Note that, in practice, the dissimilarity measures proposed in Section 4.1 
and in Berninger et al. (2008) are not identical, since in Section 4.1 we 
estimate dij and di/j as the posterior means using a Gamma prior. Berninger 
et al. (2008) instead use a Dirichlet prior on qi,...,qp and ri,...,rp and 
use the Bayes factor as the dissimilarity measure. In fact, the proposal of 
Berninger et al. (2008) seems to perform substantially worse than that of 
Section 4.1 in the simulation study in Section 5. 
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